15 research outputs found
Semantic Equivariant Mixup
Mixup is a well-established data augmentation technique, which can extend the
training distribution and regularize the neural networks by creating ''mixed''
samples based on the label-equivariance assumption, i.e., a proportional mixup
of the input data results in the corresponding labels being mixed in the same
proportion. However, previous mixup variants may fail to exploit the
label-independent information in mixed samples during training, which usually
contains richer semantic information. To further release the power of mixup, we
first improve the previous label-equivariance assumption by the
semantic-equivariance assumption, which states that the proportional mixup of
the input data should lead to the corresponding representation being mixed in
the same proportion. Then a generic mixup regularization at the representation
level is proposed, which can further regularize the model with the semantic
information in mixed samples. At a high level, the proposed semantic
equivariant mixup (sem) encourages the structure of the input data to be
preserved in the representation space, i.e., the change of input will result in
the obtained representation information changing in the same way. Different
from previous mixup variants, which tend to over-focus on the label-related
information, the proposed method aims to preserve richer semantic information
in the input with semantic-equivariance assumption, thereby improving the
robustness of the model against distribution shifts. We conduct extensive
empirical studies and qualitative analyzes to demonstrate the effectiveness of
our proposed method. The code of the manuscript is in the supplement.Comment: Under revie
Exploring and Exploiting Uncertainty for Incomplete Multi-View Classification
Classifying incomplete multi-view data is inevitable since arbitrary view
missing widely exists in real-world applications. Although great progress has
been achieved, existing incomplete multi-view methods are still difficult to
obtain a trustworthy prediction due to the relatively high uncertainty nature
of missing views. First, the missing view is of high uncertainty, and thus it
is not reasonable to provide a single deterministic imputation. Second, the
quality of the imputed data itself is of high uncertainty. To explore and
exploit the uncertainty, we propose an Uncertainty-induced Incomplete
Multi-View Data Classification (UIMC) model to classify the incomplete
multi-view data under a stable and reliable framework. We construct a
distribution and sample multiple times to characterize the uncertainty of
missing views, and adaptively utilize them according to the sampling quality.
Accordingly, the proposed method realizes more perceivable imputation and
controllable fusion. Specifically, we model each missing data with a
distribution conditioning on the available views and thus introducing
uncertainty. Then an evidence-based fusion strategy is employed to guarantee
the trustworthy integration of the imputed views. Extensive experiments are
conducted on multiple benchmark data sets and our method establishes a
state-of-the-art performance in terms of both performance and trustworthiness.Comment: CVP
UMIX: Improving Importance Weighting for Subpopulation Shift via Uncertainty-Aware Mixup
Subpopulation shift widely exists in many real-world machine learning
applications, referring to the training and test distributions containing the
same subpopulation groups but varying in subpopulation frequencies. Importance
reweighting is a normal way to handle the subpopulation shift issue by imposing
constant or adaptive sampling weights on each sample in the training dataset.
However, some recent studies have recognized that most of these approaches fail
to improve the performance over empirical risk minimization especially when
applied to over-parameterized neural networks. In this work, we propose a
simple yet practical framework, called uncertainty-aware mixup (UMIX), to
mitigate the overfitting issue in over-parameterized models by reweighting the
''mixed'' samples according to the sample uncertainty. The
training-trajectories-based uncertainty estimation is equipped in the proposed
UMIX for each sample to flexibly characterize the subpopulation distribution.
We also provide insightful theoretical analysis to verify that UMIX achieves
better generalization bounds over prior works. Further, we conduct extensive
empirical studies across a wide range of tasks to validate the effectiveness of
our method both qualitatively and quantitatively. Code is available at
https://github.com/TencentAILabHealthcare/UMIX.Comment: NeurIPS 202
Reweighted Mixup for Subpopulation Shift
Subpopulation shift exists widely in many real-world applications, which
refers to the training and test distributions that contain the same
subpopulation groups but with different subpopulation proportions. Ignoring
subpopulation shifts may lead to significant performance degradation and
fairness concerns. Importance reweighting is a classical and effective way to
handle the subpopulation shift. However, recent studies have recognized that
most of these approaches fail to improve the performance especially when
applied to over-parameterized neural networks which are capable of fitting any
training samples. In this work, we propose a simple yet practical framework,
called reweighted mixup (RMIX), to mitigate the overfitting issue in
over-parameterized models by conducting importance weighting on the ''mixed''
samples. Benefiting from leveraging reweighting in mixup, RMIX allows the model
to explore the vicinal space of minority samples more, thereby obtaining more
robust model against subpopulation shift. When the subpopulation memberships
are unknown, the training-trajectories-based uncertainty estimation is equipped
in the proposed RMIX to flexibly characterize the subpopulation distribution.
We also provide insightful theoretical analysis to verify that RMIX achieves
better generalization bounds over prior works. Further, we conduct extensive
empirical studies across a wide range of tasks to validate the effectiveness of
the proposed method.Comment: Journal version of arXiv:2209.0892
Introduction to Special Issue - In-depth study of air pollution sources and processes within Beijing and its surrounding region (APHH-2 Beijing)
Abstract. The Atmospheric Pollution and Human Health in a Chinese Megacity (APHH-Beijing) programme is an international collaborative project focusing on understanding the sources, processes and health effects of air pollution in the Beijing megacity. APHH-Beijing brings together leading China and UK research groups, state-of-the-art infrastructure and air quality models to work on four research themes: (1) sources and emissions of air pollutants; (2) atmospheric processes affecting urban air pollution; (3) air pollution exposure and health impacts; and (4) interventions and solutions. Themes 1 and 2 are closely integrated and support Theme 3, while Themes 1-3 provide scientific data for Theme 4 to develop cost-effective air pollution mitigation solutions. This paper provides an introduction to (i) the rationale of the APHH-Beijing programme, and (ii) the measurement and modelling activities performed as part of it. In addition, this paper introduces the meteorology and air quality conditions during two joint intensive field campaigns - a core integration activity in APHH-Beijing. The coordinated campaigns provided observations of the atmospheric chemistry and physics at two sites: (i) the Institute of Atmospheric Physics in central Beijing, and (ii) Pinggu in rural Beijing during 10 November – 10 December 2016 (winter) and 21 May- 22 June 2017 (summer). The campaigns were complemented by numerical modelling and automatic air quality and low-cost sensor observations in the Beijing megacity. In summary, the paper provides background information on the APHH-Beijing programme, and sets the scene for more focussed papers addressing specific aspects, processes and effects of air pollution in Beijing
An interlaboratory comparison of aerosol inorganic ion measurements by ion chromatography : Implications for aerosol pH estimate
Water-soluble inorganic ions such as ammonium, nitrate and sulfate are major components of fine aerosols in the atmosphere and are widely used in the estimation of aerosol acidity. However, different experimental practices and instrumentation may lead to uncertainties in ion concentrations. Here, an intercomparison experiment was conducted in 10 different laboratories (labs) to investigate the consistency of inorganic ion concentrations and resultant aerosol acidity estimates using the same set of aerosol filter samples. The results mostly exhibited good agreement for major ions Cl-, SO2-4, NO-3, NHC4 and KC. However, F-, Mg2C and Ca2C were observed with more variations across the different labs. The Aerosol Chemical Speciation Monitor (ACSM) data of nonrefractory SO2-4, NO-3 and NHC4 generally correlated very well with the filter-analysis-based data in our study, but the absolute concentrations differ by up to 42 %. Cl-from the two methods are correlated, but the concentration differ by more than a factor of 3. The analyses of certified reference materials (CRMs) generally showed a good detection accuracy (DA) of all ions in all the labs, the majority of which ranged between 90 % and 110 %. The DA was also used to correct the ion concentrations to showcase the importance of using CRMs for calibration check and quality control. Better agreements were found for Cl-, SO2-4, NO-3, NHC4 and KC across the labs after their concentrations were corrected with DA; the coefficient of variation (CV) of Cl-, SO2-4, NO-3, NHC4 and KC decreased by 1.7 %, 3.4 %, 3.4 %, 1.2 % and 2.6 %, respectively, after DA correction. We found that the ratio of anion to cation equivalent concentrations (AE/CE) and ion balance (anions-cations) are not good indicators for aerosol acidity estimates, as the results in different labs did not agree well with each other. In situ aerosol pH calculated from the ISORROPIA II thermodynamic equilibrium model with measured ion and ammonia concentrations showed a similar trend and good agreement across the 10 labs. Our results indicate that although there are important uncertainties in aerosol ion concentration measurements, the estimated aerosol pH from the ISORROPIA II model is more consistent
Uncertainty-Aware Multi-View Representation Learning
Learning from different data views by exploring the underlying complementary information among them can endow the representation with stronger expressive ability. However, high-dimensional features tend to contain noise, and furthermore, quality of data usually varies for different samples (even for different views), i.e., one view may be informative for one sample but not the case for another. Therefore, it is quite challenging to integrate multi-view noisy data under unsupervised setting. Traditional multi-view methods either simply treat each view with equal importance or tune the weights of different views to fixed values, which are insufficient to capture the dynamic noise in multi-view data. In this work, we devise a novel unsupervised multi-view learning approach, termed as Dynamic Uncertainty-Aware Networks (DUA-Nets). Guided by the uncertainty of data estimated from the generation perspective, intrinsic information from multiple views is integrated to obtain noise-free representations. Under the help of uncertainty estimation, DUA-Nets weigh each view of individual sample according to data quality so that the high-quality samples (or views) can be fully exploited while the effects from the noisy samples (or views) will be alleviated. Our model achieves superior performance in extensive experiments and shows the robustness to noisy data
Knowledge-guided machine learning reveals pivotal drivers for gas-to-particle conversion of atmospheric nitrate
Particulate nitrate, a key component of fine particles, forms through the intricate gas-to-particle conversion process. This process is regulated by the gas-to-particle conversion coefficient of nitrate (ε(NO3−)). The mechanism between ε(NO3−) and its drivers is highly complex and nonlinear, and can be characterized by machine learning methods. However, conventional machine learning often yields results that lack clear physical meaning and may even contradict established physical/chemical mechanisms due to the influence of ambient factors. It urgently needs an alternative approach that possesses transparent physical interpretations and provides deeper insights into the impact of ε(NO3−). Here we introduce a supervised machine learning approach—the multilevel nested random forest guided by theory approaches. Our approach robustly identifies NH4+, SO42−, and temperature as pivotal drivers for ε(NO3−). Notably, substantial disparities exist between the outcomes of traditional random forest analysis and the anticipated actual results. Furthermore, our approach underscores the significance of NH4+ during both daytime (30%) and nighttime (40%) periods, while appropriately downplaying the influence of some less relevant drivers in comparison to conventional random forest analysis. This research underscores the transformative potential of integrating domain knowledge with machine learning in atmospheric studies